<<<<<<< Updated upstream ======= >>>>>>> Stashed changes

STA 9750 Final Project

Introduction

Many factors such as user reviews, budget of the movie, actor’s and director’s popularity etc play role in making a movie successful.IMDB is the world’s most popular and authoritative source designed to help people what to watch.Thus IMDB plays an important role in measuring success of a movie by rating it considering various factors.This project focuses on finding out find which factor impacts the IMDB rating of a movie the most.

We fetched our data from IMDB 5000 movie dataset (https://www.kaggle.com/suchitgupta60/IMDB-data), which consists of 5043 movies across 100 years from 66 countries. The data holds 28 variables such as Director, Actors, Duration, Gross, Budget, Genres, Facebook Likes, etc.

We will be using some of the modeling techniques with associated visualizations to identify the most important variable that impacts the success and rating of a movie.

Data Exploration

After cleaning the data, we narrowed our scope of model to 26 variables and 3806 rows. We chose to remove some features such as aspect ratio, IMDB movie link and color as they reduced the quality of our data and were less important to our analysis.We also calculated net profit and ROI of all the movies and added them to the factors that may impact the rating of a movie.Further to simplify our data we bifurcated the countries in three categories putting other countries except USA and UK into ‘others’.We also replaced all content_rating with modern rating system.

IMDB offers a grading scale that allows users to rate films on a scale of one to ten. It indicates that submitted scores are filtered and weighted in various ways in order to produce a weighted mean that is displayed for each movie.

Movies with IMDB ratings above 7.5 are considered to be highly recommended.As per the distribution shown below majority of the movies are rated 7.6 with only a handful of them rated above 9. The highest rating received by a movie is 9.3.

Majority of the movies are between the range of 6.5 to 7.7 which is considered as an average IMDB score. The histogram closely fits a normal distribution.However, there are only a handful of phenomenal movies which are rated above 8.

<<<<<<< Updated upstream

=======

>>>>>>> Stashed changes

The table below is filtered by IMDB score greater than 7.5 and arranged in descending order. The majority of the movies has the IMDB score of 7.6 .As the IMDB score increases above 8.8, the number of movies drop to less than 5. Only 0.21% of the movies are rated above 8.8 which we can also see in the histogram shown above.

## # A tibble: 17 x 2
## # Groups:   imdb_score [17]
##    imdb_score     n
##         <dbl> <int>
##  1        7.6   100
##  2        7.7    90
##  3        7.8    83
##  4        8      55
##  5        7.9    52
##  6        8.1    48
##  7        8.2    24
##  8        8.3    24
##  9        8.5    19
## 10        8.4    15
## 11        8.6     8
## 12        8.7     7
## 13        8.8     5
## 14        8.9     4
## 15        9       2
## 16        9.2     1
## 17        9.3     1

Impact of content rating on IMDB score

The average IMDB score is 6.4602995which is considered as a poor score. The content rating with ‘R’ category has the highest count of 1809 movies which may be the reason that ‘R’ has the highest IMDB rating compared to others. However, PG-13 has the second highest count of 1314 movies with an average IMDB score of less than 6.3. As per this distribution we conclude that content rating does not show a strong impact on the IMDB score of a movie.

## `summarise()` ungrouping output (override with `.groups` argument)
<<<<<<< Updated upstream

=======

>>>>>>> Stashed changes

Understanding the distribution of directors and their effect on the IMDB score

We grouped the directors here by the number of movies they directed.Further, we filtered the data to show only directors with movies directed above 10 and below 50 to remove any anomalies in the data.Logically directors with more movies could have a higher fan following, credibility and success rate possibly leading to a higher IMDB score.

According to the distribution shown below, even after filtering, the number of movies for most of the directors are between 10 to 15, few are in the range of 15 to 20 and rest two are outliers. This indicates that in this time frame the most naturalistic production of movies by the directors is between 10 to 15 range. The rational can be budget, resources or time constraints.

## `summarise()` ungrouping output (override with `.groups` argument)
<<<<<<< Updated upstream

The chart below shows the average IMDB score for directors with 15 or more directed movies.Only few directors have directed movies above 15 in this data set.The IMDB score is above 5.5 for directors with more than 15 movies. Steven Spielberg is the only director to direct ~ 24 movies. Most of the directors here have received a higher IMDB score which shows that the number of movies directed has a slight impact on the IMDB score.

=======

The chart below shows the average IMDB score for directors with 15 or more directed movies.Only few directors have directed movies above 15 in this data set.The IMDB score is above 5.5 for directors with more than 15 movies. Steven Spielberg is the only director to direct ~ 24 movies. Most of the directors here have received a higher IMDB score which shows that the number of movies directed has a slight impact on the IMDB score.

>>>>>>> Stashed changes

Top 20 movies by IMDB score

The scatter plot below shows the top 20 movies that have received the highest IMDB scores. Most of the directors have more than one movie rated above 7.5 which is considered to be a good score. These movies have received higher user reviews compared to other movies in the data set. The minimum user reviews are 1000 for these top 20 movies which is significantly higher than the median of 205 user reviews.

1, 105, 205, 330.0793484, 392.75, 5060

## `summarise()` regrouping output by 'director_name' (override with `.groups` argument)

Impact of country on the IMDB score

We grouped all the other countries except U.S and U.K in ‘other’ category while cleaning the data as these countries were significantly lower in number compared to U.S and U.K. As per the scatter plot below, highest number of movies reviewed are from U.S followed by U.K.We can also see a higher IMDB rating in U.S with the highest number of user reviews. We observe the pattern of higher scores and higher user reviews on repeat in the blow plot as well.

<<<<<<< Updated upstream

## Movie durations impact on the IMDB score

The scatter plot below shows a linear relationship between IMDB score and duration. As the duration increases the IMDB score also increases. Most movies with a score higher than 7.5 have longer duration.

=======

## Movie durations impact on the IMDB score

The scatter plot below shows a linear relationship between IMDB score and duration. As the duration increases the IMDB score also increases. Most movies with a score higher than 7.5 have longer duration.

>>>>>>> Stashed changes

Impact of net profit on IMDB score.

Movies with net profit above 200 million have higher IMDB rating. The trend below shows that higher net profits translates into higher rating. Therefore we can assume that the viewership for movies with higher net profits is higher and thus receives a higher movie rating.

The movies with higher IMDB score should generate higher net profit. But this is not always the case. There are many movies that have very good IMDB score but did not generate much profit. So, IMDB score cannot be a sole factor to consider the net profit.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
<<<<<<< Updated upstream

=======

>>>>>>> Stashed changes

Modeling techniques to identify the most important variables that impact IMDB ratings of the movie

We divide the data set into two parts with 80% of the data as the training data and the rest 20% as the testing data.

Linear Model

The linear model shown below depicts that the number of voted users,the number of critic reviews and the duration impacts IMDB score the most. The R-squared of 0.28 is extremely low which suggests that the relationship between these variables is not linear.

The low R-squared value indicates that IMDB score does not explain much about the variation in the dependent variables such as duration, num_voted_users, num_critic_for_reviews and movie_facebook_likes. Regardless of the variable significance,we can infer that the identified independent variable, even though significant, does not account for much of the mean of the dependent variable.

## 
## Call:
## lm(formula = imdb_score ~ duration + num_voted_users + num_critic_for_reviews + 
##     movie_facebook_likes, data = IMDB_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
<<<<<<< Updated upstream
## -4.2440 -0.5105  0.0826  0.6210  2.4911 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.948e+00  8.187e-02  60.438   <2e-16 ***
## duration               1.052e-02  7.439e-04  14.143   <2e-16 ***
## num_voted_users        2.479e-06  1.381e-07  17.948   <2e-16 ***
## num_critic_for_reviews 4.449e-04  1.930e-04   2.305   0.0212 *  
## movie_facebook_likes   2.084e-06  1.116e-06   1.867   0.0620 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8839 on 3039 degrees of freedom
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2867 
## F-statistic: 306.7 on 4 and 3039 DF,  p-value: < 2.2e-16
## [1] 0.9382928
======= ## -4.2411 -0.5024 0.0968 0.6358 2.4695 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.968e+00 8.346e-02 59.522 < 2e-16 *** ## duration 1.011e-02 7.595e-04 13.309 < 2e-16 *** ## num_voted_users 2.373e-06 1.389e-07 17.077 < 2e-16 *** ## num_critic_for_reviews 5.898e-04 2.005e-04 2.941 0.00329 ** ## movie_facebook_likes 2.417e-06 1.211e-06 1.996 0.04604 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8958 on 3039 degrees of freedom ## Multiple R-squared: 0.2784, Adjusted R-squared: 0.2774 ## F-statistic: 293.1 on 4 and 3039 DF, p-value: < 2.2e-16
## [1] 0.8919451
>>>>>>> Stashed changes

Random Forest to determine the variable that has the most impact on the IMDB score

Now let’s run a Random Forest Model with our variables to identify the most important variable on the training data. Random forest will include all the variables from the data set. Variables by importance are plotted below which depicts that the number of voted user impacts IMDB score the most.

##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
<<<<<<< Updated upstream
##   50 |   0.5104    46.62 |
##  100 |   0.4887    44.64 |
##  150 |    0.483    44.12 |
##  200 |    0.479    43.75 |
##  250 |   0.4782    43.68 |
##  300 |   0.4774    43.61 |
##  350 |   0.4764    43.51 |
##  400 |   0.4762    43.50 |
##  450 |   0.4758    43.46 |
##  500 |   0.4746    43.35 |

The root mean squared error for the above random forest is 0.7609269 making it an average model.

======= ## 50 | 0.5166 46.54 | ## 100 | 0.4987 44.92 | ## 150 | 0.4919 44.32 | ## 200 | 0.4873 43.90 | ## 250 | 0.4858 43.77 | ## 300 | 0.4837 43.57 | ## 350 | 0.4825 43.46 | ## 400 | 0.4829 43.50 | ## 450 | 0.482 43.42 | ## 500 | 0.482 43.42 |

The root mean squared error for the above random forest is 0.7347514 making it an average model.

>>>>>>> Stashed changes

Random Forest with select variables to reduce the Mean Squared Error

The mean squared error of the model below is mean((predict.IMDB.rf - IMDB_test$imdb_score)^2) which is lower than the previous model mean((predicted.rf - IMDB_test$imdb_score)^2).As this model uses only some important variables placed on top it could result into a lower root mean squared error.

<<<<<<< Updated upstream
## [1] 0.7450838
## [1] 0.5551498

=======
## [1] 0.7156819
## [1] 0.5122006

>>>>>>> Stashed changes

We can see that the most important variable is the number of voted users. The reason is quite obvious because the rating only generates when people vote or give reviews for the movies.The second most important factor is the duration of a movie. This is quite interesting because this is not something which is easily thought about. However, the logic behind this could be that the movies with longer duration are generally high budgeted ones with popular star cast. Therefore, the quality of the movies with longer duration are usually better. The third factor is the facebook likes. Even though this factor is a difficult to predict, we can reason that people likes something on facebook only if they truly enjoy something. So, facebook likes could also impact IMDB ratings.

The importance of next three variables, budget, genres and number of user reviews, is very close in quantitative terms.High budgeted movies would typically have a tendency to get high IMDB scores because they are usually created with a lot of hype and promotion. Genres also have an impact because some genres are more attractive to users than others. Typically, action and thriller movies are preferred to many viewers.Also number of user reviews are important as these users directly rate the movies on the IMDB website.

Conclusion

Random Forest took into consideration all the variables from the datas et to understand their impact on the IMDB score.Therefore we conclude that number of voted users is the most important variable for a high IMDB score followed by duration and facebook likes received by the audience. It is surprising to see, actors and directors names were among the least important factors as one would think that directors and actors bring in publicity leading to high viewership.